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Monte Carlo sampling methods often suffer from long correlation 
times. Consequently, these methods must be run for many steps to 
generate an independent sample. In this paper a method is proposed 
to overcome this difficulty. The method utilizes information from 
rapidly equilibrating coarse Markov chains that sample marginal dis- 
tributions of the full system. This is accomplished through exchanges 
between the full chain and the auxiliary coarse chains. Results of nu- 
merical tests on the bridge sampling and filtering/smoothing prob- 
lems for a stochastic differential equation are presented. 

1. Introduction. In spite of substantial effort to improve the efficiency 
of Markov chain Monte Carlo (MCMC) methods, spatial correlations remain 
a major impediment. These correlations can severely restrict the possible 
configurations of a system by imposing complicated relationships between 
variables. It is well known that judicious elimination of variables by renor- 
malization can reduce long range correlations (see [U |2]). The remaining 
variables are distributed according to the marginal distribution, 



where ir (x, y) is the full distribution. Given the values of the x variables and 
the marginal distribution 7f the y variables are distributed according to the 
conditional distribution 



For systems exhibiting critical phenomena, the path through the space of 
distributions taken by marginal distributions under repeated renormaliza- 
tion can yield essential information about critical indices and the location 
of critical points (see [TJ [2]). More generally, because these marginal dis- 
tributions exhibit shorter correlation lengths and weaker local correlations, 
they are useful in the acceleration of Markov chain Monte Carlo methods. 
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As explained in the next section, parallel marginalization takes advantage 
of the shorter correlation lengths present in marginal distributions of the 
target density. 

The use of Monte Carlo updates on lower dimensional spaces is not a 
new concept. In fact this is a necessary procedure in high dimensions. One 
simply constructs a chain with steps that preserve the conditional probabil- 
ity density of the full measure. This is usually accomplished by perturbing 
a few components of the chain while holding all other components of the 
chain constant. In other words the chain takes steps of the form 

Y n+1 = (xi, Xi-i,Xi + e, x i+1 , ...,x d ) 

where 

Y n = ( Xl ,...,x d ) 

and the move preserves ir(xi\xi, . . . , a?i_i, x^+i, . . . , x d )- There have been 
many important attempts to use proposals in more general sets of projected 
coordinates. The multi-grid Monte Carlo method presented in [31 S] is one 
such method. These techniques do not incorporate marginal densities. 

In [5j, Brandt and Ron propose a multi-grid method which approximates 
successive marginal distributions of the Ising model and then uses these ap- 
proximations to generate large scale movements of the Markov chain sam- 
pling the full joint distribution of all variables. Their method, while demon- 
strating the efficacy of incorporating information from successive marginal 
distributions, suffers from two limitations. First, the method used to ap- 
proximate the marginal distributions is specific to a small class of problems. 
For example, it cannot be easily generalized to systems in continuous spaces. 
Second, information from the approximate marginal distributions is adopted 
by the Markov chain in a way which does not preserve the target distribution 
of all variables. 

The design of a generally applicable method which approximates the 
marginal distributions was addressed in [6l [7] by Chorin, and in [8] by Sti- 
nis. Both authors approximate the renormalized Hamiltonian of the system 
given by the formula, 

Ti(x) = - log J vr (x, y) dy. 

Thus exp (^— TL (x)^j is the marginal distribution of the x variables. Chorin 

determines the coefficients in an expansion of TC (x) by first expanding the 

derivatives , which can be expressed as conditional expectations with 
respect to the full distribution. Stinis shows that a maximum likelihood 
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approximation to the renormalized Hamiltonian can be found by minimizing 
the error in the expectations of the basis functions in an expansion of 7i (x). 
For applications of related ideas to MCMC simulations see jU] and [TU] . 

Two Parallel marginalization algorithms are developed in the next section 
along with propositions that guarantee that the resulting Markov chains 
satisfy the detailed balance condition. In the final section the conditional 
path sampling problem is described and numerical results are presented for 
the bridge sampling and smoothing/filtering problems. A brief introduction 
to parallel marginalization can be found in 

2. Parallel marginalization. In this section, it is assumed that appro- 
priate approximate marginal distributions are available. How to find these 
marginal distributions depends on the application and will be discussed 
here only in the context of the examples presented in this paper. A new 
Markov chain Monte Carlo method is introduced which uses approximate 
marginal distributions of the target distribution to accelerate sampling. Aux- 
iliary Markov chains that sample approximate marginal distributions are 
evolved simultaneously with the Markov chain that samples the distribution 
of interest. By swapping their configurations, these auxiliary chains pass 
information between themselves and with the chain sampling the original 
distribution. 

Assume that the system of interest has a probability density, 7To(£o)j where 
xq lies in some space E. Suppose further that, by the Metropolis-Hastings 
or any other method (see [12 ), one can construct a Markov chain, Y n £ E, 
which has ttq as its stationary measure. That is, for two points xoi2/o e ^ 



where ro(yol^o) is the probability density of a move to |^o™ + = 2/0 j given 
that {Yq 1 = xo}. Here, n is the algorithmic step. 

In order to take advantage of the shorter spatial correlations exhibited 
by marginal distributions of ttq, a collection of lower dimensional Markov 
chains which approximately sample marginal distributions of ttq is consid- 
ered. Suppose the random variable Xo has cfo components. Divide these into 
two subsets, 



where Xq has d\ components and Xo has d$ — d\ components. Recall that 
the Xq variables are distributed according to the marginal density, 





(1) 
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and that given the value of the Xo variables, the Xq variables are distributed 
according to the conditional density, 

(2) Tr{x \x ) = _ 

no{x ) 

Label the domain of the Xq variables E\ . Suppose further that an approxi- 
mation to the marginal distribution of the Xo variables, 

7Q (xo) pa 7fo (#o) 

is available. The sense in which tti approximates 7fo is intentionally left 
vague. In applications of parallel marginalization the accuracy of the ap- 
proximation manifests itself through an acceptance rate. 

Now let X\ £ E\ be independent of the Xo random variables and drawn 
from 7Ti (xo). Notice that X\ represents the same physical variables as Xq 
though its probability density is not the exact marginal density. Continue 
in this way to remove variables from the system by decomposing Xi G E\ 
into proper subsets as 

Xi = (x h xi) 

and defining Xi + \ £ Ei + \ to be independent of the {Xo, . . . ,X[} random 
variables and drawn from an approximation tti + \ to tti (xi). Clearly each 
Xi + i represents fewer physical variables than X\. 

Just as one can construct a Markov chain Yg n G Eq to sample Xq, one 
can also construct Markov chains Y™ S E[ to sample tti. In other words, for 
each Yf 1 choose a transition probability density 77, such that 



J n(yi\xi)iri (xi) dxi = tti (yi) 



for all i. 

The chains Y t n can be arranged in parallel to yield a larger Markov chain, 

Y n = (Y n ,...,Y£)£E x---xE L . 

The probability density of a move to |y n+1 = y} given that {Y n = x} for 
x, y G Eo x • • • x El is given by 

L 

(3) r(y\x) = Y[n{yi\xi). 



1=0 



Since 



/ \ T (y\ x ) (xi) \ dx ...dx L = Y[ni(yi) 
J \ 1=0 ) 1=0 
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the stationary distribution of Y n is 

II (x , ...,X L )=TT (xq) . . . 7T£ (x l ) . 

The next step in the construction is to allow interactions between the 
chains Yj n and to thereby pass information from the rapidly equilibrating 
chains on the lower dimensional spaces (large I) down to the chain on the 
original space (I = 0). This is accomplished by swap moves. In a swap move 
between levels I and l+l, a subset, x,\ 6 £7+1, of the x\ variables is exchanged 
with the Xi + \ £ variables. The remaining X\ variables are resampled 

from the conditional distribution 717 (xi\xi+i). For the full chain, this swap 
takes the form of a move from {Y n = x} to |y n+1 = y} where 

x = (...,Xl,X h Xl +1 ,...) 

and 

y=(.. . ,x i+1 ,yi,xi, ...) . 

The yi variables are drawn from iri (xi\xi + i) and the ellipses represent com- 
ponents of Y n that remain unchanged in the transition. 

If these swaps are undertaken unconditionally, the resulting chain may 
equilibrate rapidly, but will not, in general, preserve the product distribution 
II. To remedy this the swap acceptance probability 

/a \ , ■ J, ^l(xi + l)7Tl + 1 (xi) 

( 4 M = mm<^ 1, — — . 

I 7ri{xi)7ri + i{xi + i) 

is introduced. Recall that ~Wi is the function resulting from the integration 
of 7r; over the x\ variables as in equation ([T]). Given that {Y n = x}, the 
probability density of {y n+1 = y} 5 after the proposal and either acceptance 
with probability A\ or rejection with probability 1 — Ai, of a swap move, is 
given by 



^ (y\x) = (1 - Ai) l[S {y . =x 



+ Ai ni(yi\xi +1 ) S mm+l)={xi+lA)} ]J 5 



{Vj= x j} 

for x,y £ Eq x ■ ■ ■ x El. 5 is the Dirac delta function. 
We have the following proposition. 

Proposition 1. The transition probabilities ipi satisfy the detailed bal- 
ance condition for the measure II, i.e. 

Ii(x) ipi (y\x) = Ii(y) ipi (x\y) 

where 2, y G Eq x ■ • • x El. 
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Proof. Fix x, y £ Eq x • • • x El such that x / y. 

n(x) Vz (y\x) = Yi n i 5 {yj=X]} n i ( x i) ^+1 Om) 

X (jl-Ai) 8{(y l ,y l+1 )=(x l ,x l+1 )} + M TTl(yi\xi +1 ) ^{( yi ,y l+1 )=(xi +1 ,xi)}j 

When x ^ y (IL(x)ipi (y\x)) and (Il(y)ipi (x\y)) are both zero unless Xj — 
yj for all j except I and / + 1 and (yi,yi+i) = (x;+i,x;). Therefore it is 
enough to check that the function 

R((x h x l+1 ) ,{yi,yi+i)) = 717(3^)717+1 (xi +1 ) iri (yi\x l+ i) A t 

is symmetric in (x u x i+ i) and {y h y l+1 ) when (yi,y i+1 ) = (x i+1 ,xi). Plugging 
in the definition of Ai, 

r> ( \ t \ (~\ \ ■ /i ^(^+1)^+1(^)1 

R = tti (xi) 7T i+ i (xi +1 ) tti (yi\xi + i) mmM, — — )■ 

I 7T/ (37)717+1 (x;+i) J 

Rearranging terms gives, 

R = 1Ti (37) 717+1 (37+1) 7T; (y/|x i+ i) TT^ (x i+ l) 7T / + 1 (y Z+1 ) 

1 1 

x mm 



Kl{%l+l)m+i(yi+i) ' Ki(yi + i)iri + i(xi + i) 
Recall from that 717(^137+1)717(37+1) = 717(37+1, y{). Therefore, since 

xi+i = m, 

R = n (x/) 7r/+i (xi+x) 717 (y{) 717+1 (y l+1 ) 



x mm 



1 1 



717 (37+1) 717+1 (y m ) ' 7ri(yi+i)n+i(xi+i) 



The final formula is clearly symmetric in (37,37+1) and (yi,yi+i). □ 

The detailed balance condition stipulates that the probability of observing 
a transition x — > y is equal to that of observing a transition y — > x and 
guarantees that the resulting Markov Chain preserves the distribution n. If 
the chain is also Harris recurrent then averages over a trajectory of Y n will 
converge to averages over n. In fact, chains generated by swaps as described 
above cannot be recurrent and must be combined with another transition 
rule to generate a convergent Markov chain. Since 

tto(xo) = J n(x , ...,xl) dxi... dx L , 
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if Y n is Harris recurrent with invariant distribution II, averages over ttq 
can be calculated by taking averages over the trajectories of the first do 
components of Y n . 

2.1. Approximation of acceptance probabilities. Notice that the formula 
@ for Ai requires the evaluation of 717 at the points S -E7+1. While 

the approximation of Wi by functions on £7+1 is in general a very difficult 
problem, its evaluation at a single point is often not terribly demanding. In 
fact, in many cases, including the examples in Chapter 3, the Xi variables can 
be chosen so that the remaining X\ variables are conditionally independent 
given X h 

Despite this mitigating factor, the requirement that Wi be evaluated be- 
fore acceptance of any swap is inconvenient. Fortunately, and somewhat 
surprisingly, this requirement is not necessary. In fact, standard strategies 
for approximating the point values of the marginals yield Markov chains 
that also preserve the target measure. Thus even a poor estimate of the ra- 
tio appearing in Q can give rise to a method that is exact in the sense that 
the resulting Markov chain will asymptotically sample the target measure. 

Before moving on to the description of the resulting Markov chain Monte 
Carlo algorithms consider briefly the general problem of evaluating marginal 
densities. Let pi(x,y) and P2(x,y) be the densities of two equivalent mea- 
sures with marginal densities, 



p x {x) = I p x (x,y)dy 

and 



Pii x ) = / P2(x,y)dy 
respectively. For any integrable function 7(2;, y), 

E P1 [7 (X, Y) p 2 (X, Y) I {X = x}} = J 7 (x, y) P2 (x, y) P i(y\x)dy 

^ P2( x ) 

p x (x) 



l{x,y)p2(y\x)pi{x,y)dy 
E P2 [ 7 (X,Y) P1 (X,Y)\{X = x}\ 



p x (x 

Thus given p~2(x) , the value of p x at x can be obtained through the formula 



(5) Pi(x)=p 2 {x) 



E P2 h(X,Y) Pl (X,Y)\{X = x}] 
E P1 h(X,Y) P2 (X,Y)\{X = x}\ 
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Of course, the usual importance sampling concerns apply here. In particular, 
the approximation of the conditional expectations in §5§ will be much easier 
when Y lives in a lower dimensional space. 

Similar approximations can be inserted into our acceptance probabilities 
Ai in place of the ratio ■ For example, \ipx(x\\xi) is a reference density 

approximating iri(xi\xi), then the choice 



l(xi,oti) 



1 



Pl{xi,xi) 



yields 



(6) 



TTl(x) -Pl(x)—J2 



vi(xi,V j ) 

Pi {X, V 3 ) 



1 

M 



E 



Pi{V 3 \xi) 



where the V 3 are samples from pi(xi\&i). Thus if U 3 are samples from 
pi(xi\x i+1 ), then 



M 2^j=l pl (UJ\ Xl+1 ) 

J_ V^M TTl{x U Vi) 

M 1-3=1 p<(VJ|x<) 



E 



E 



in 



(Xi\ 



xi 



{x t = xi) 



Ki{xi) 



In the numerical examples presented here, pi{ ■ \x{) is a Gaussian approxi- 
mation of tti(xi\xi). How is chosen depends on the problem at hand (see 
numerical examples below). In general pi{ ■ \xi) should be easily evaluated 
and independently sampled, and it should "cover" 7rj( • \xi) in the sense 
that regions where 717 ( • \x{) is not negligible should be contained in regions 
where pi{ ■ \xi) is not negligible. In the case mentioned above that the Xi 
variables can be chosen so that the remaining X\ variables are conditionally 
independent given X\ the conditional density iri(xi\xi) can be written as a 
product of many low dimensional densities. As mentioned above, the prob- 
lem of finding a reference density for importance sampling is much simpler 
in low dimensional spaces. 

The following algorithm results from replacing A\ in Q with approxi- 
mation of the form (|6]). Assume that the current position of the chain is 
jyn — x y w h ere 

x = (. . . ,xi,xi,x t+ i, ...) . 
Algorithm [l] will result in either |y n+1 = x} or |y ra + 1 = y} where 

y = (...,xi +1 ,y h xi,...) 



and yi is approximately drawn from iri (xi\xi^ 
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Algorithm 1 (Parallel Marginalization 1). The chain moves from Y n 
to Y n+1 as follows: 

1. Let U 3 for j = 1, . . . , M be independent random variables sampled 
from pi( • \xi + i) (recall that the swap is between xi and xi + \ which are 
both in Ei + i). 

2. Evaluate the weights 

j = n (xi+uW) 

u P i(ui\x l+1 y 

The choice of pi made above affects the variance of these weights, and 
therefore the variance of the acceptance probability below. 

3. Draw the random index J £ {1, . . . , M} according to the probabilities 

Z^m=l VV U 

Set Y' = U J . Notice that Y' is an approximate sample from tti{ • 
4- Let V J = x\ and draw V 3 for j / J independently from pi{ ■ \xi), 
Notice that the U 3 variables depend on xi + \ while the V 3 variables 
depend on X[. 

5. Define the weights 

j = in {x h yj) 

V Pi(Vi\ Xl ) 

6. Set 

Y n+1 = (...,x l+1 ,Y',x h ...) 

with probability 

I 7ri + i(x J+ i)X)j=iWvJ 

and 

Y n+1 = Y n = (..., x h x h x l+l ,...) 
with probability 1 — Af 1 . 

The transition probability density for the above swap move from x to y 
for x, y G Eq x • • • x Ei,\s given by 

W^yk) = P [{Swap is rejected}] n^te=^} 

+ P [{Swap is accepted} D = y/j] 

x ${{yi,yi+i)=(xi+i,xi)} II ^{%=zj}> 
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where 5 is again the Dirac delta function. Notice that to find the proba- 



one must integrate over 



bility density P {Swap is accepted} n jV' = y\\ 

the possible values of the U J and V J variables. Since tti appears in the in- 
tegrand it is not possible, in general, to evaluate the integral. However, as 
indicated in the proof of the next proposition, it is not necessary to evaluate 
this density to show that the method converges. 

While the preceding swap move corresponds to a method for approximat- 
ing the ratio 

ni{xi +l ) 

appearing in formula Q for A\, it also has similarities with the multiple-try 
Metropolis method, presented in [131 E] > that uses multiple suggestion sam- 
ples to improve acceptance rates of standard MCMC methods. In fact the 
proof of the following proposition is motivated by the proof of the detailed 
balance condition for the multiple try method. 

Proposition 2. The transition probabilities ipf 1 satisfy the detailed bal- 
ance condition for the measure II. 

PROOF. For x,y G Eq x • • • x El such that x ^ y, 
n(x) V/ M (yk) = n(x) P {Swap is accepted} n = m} 

As in the previous proof it can be assumed that Xj = yj for all j except I 
and I + 1 and (yi,yi + \) = (xi + \,xi). Since in this case ir(xj) = ir(yj) for all 
j ^ {/, I + 1} , it remains to show that if (yi,yi+i) = (sj+i, x{) then 

R ((xi,xi+i) , {yu yi+i)) = n (xi) iri +1 (xi+i) 



X P 



{Swap is accepted} n |^ n+1 = yi\ 



is symmetric in (xi, xi + \) and (yz, yi+i) - Define a random index J £ {1, . . . , M} 
by the relation yi = U J . Then, since the are i.i.d., 

{Swap is accepted} D (y' = yij = 

M 

^2 p [{Swap is accepted} n |yz = W \ n { J = j} 

{Swap is accepted} D |yz = U 1 ^ n {J = 1} 



M P 



imsart-aos ver. 2007/04/13 file: weare07cv2.tex date: February 5, 2008 



11 



Thus, 

R ((xi,xi + i) , (yi,y i+ i)) = M 717 (27) 717+1 (arz+i) 



x P 



{Swap is accepted} n |y; = L rl |n{J = l} 



Writing out the density on the right of this relation gives, 



R = M m (xi) 717+1 (xi+x) y^minjl, 



n+l{Xi) }^ j=1 W u | p( u i|a; I+1 ) 



x p ^u 1 |a;j+ij Y\.P \ u ''\ x i+i) P (y J du ] dv 3 
Replacing u 1 by and rearranging gives, 

R = M TTi (x t ) 7T/+1 7T; (zj+i, 2/z) 7Tj + i (x;) 

,1 



nun 



]^[ jj ^u J |x/ + i^ p ru- 7 |icH du 3 dv 3 . 



X 

J>1 



Since 27+1 = yi, n(xi+l,Vl) = n(yi)- Therefore, after replacing xi by yi +1 , 

R = M 717 (xi) 717+1 (xi+i) n (yi) 717+1 (yi+i) 

1 1 

x / min 



n+i{yi + i)Y,f=iWh 7r l+1 (x l+1 )j:f =1 w^ 

Y[ P (u>\xi + i\ p (v j \yi + i \ du j dv j . 



x 



which is symmetric in (27,27+1) and (yj,y/+i). □ 



For small values of M in ( 13 ) , calculation of the swap acceptance probabil- 
ities is very cheap. However, higher values of M may improve the acceptance 
rates. For example, if the 717 for I > are exact marginals of ttq, then A\ = 1 
while Af 1 < 1. In practice one has to balance the speed of evaluating Af* 
for small M with the possible higher acceptance rates for M large. 

In analogy again with the multiple-try method, the above algorithm can 
be generalized to include correlated samples U 3 and V 3 . This generalization 
is useful because it allows reference densities that cannot be independently 
sampled. Again consider a transition from {Y n = x} where 

x = (. . .,Xl,Xl,Xl+l, . . . ) 
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to either {y n+1 = x} or |y n+1 = where 

y= (...,xi +1 ,y h xi,...) . 

First choose some reference transition densities pj (u J \ (u°, . . . , u 1 " 1 ) , x{) 
that sample a variable U J given the previous j — 1 samples and the value of 
the xi variables. Let 
(8) 

pj((u k+1 ,...,u 3 )\(u°,...,u k ), xi) = 1] Pr{u m \{u^... jU m - 1 ), xi). 

k<m<j 

For example, one might choose the pj to be Markov transition kernels associ- 
ated with some Markov chain Monte Carlo method with stationary measure 
tti(xi\xi). Also let A- 7 ((u°, . . . , u 3 ), x/,x; + i) be any function satisfying the 
relation 

A J ((u°, . . . ,u 3 ), xi,x i+ i) = X J ((V, . . . ,u°), x i+ i,xi) 
Algorithm 2 (Parallel Marginalization 2). We move the chain from 

yn tQ yn+1 ag f U ows: 

1. For j = 1,...,M sample U 3 from pj ( • {(x^U 1 , . . . , 7J- 7-1 ), x\ +1 ) . 
Notice the conditioning on the value X\ = xi + \. 

2. Define the weights 

^=7T i (^,X m )p/((^- 1 ,...,[/ 1 ,X / ) \U 3 , xi) 

x X 3 ({xuU 1 , . . . ,U 3 ) , xi,xi+i) . 

Notice the reversal in the ordering of the U 3 and the conditioning on 
Xi = x t . 

3. Choose the random index J € {1, . . . , M} according to the probabilities 



v[J = j] 



u 



m=l W U 

Set Y' = U J . 

4. Let V J = xi and for j = 1,...,J - 1 let V 3 = U J ~ 3 . For j = 
J + 1, . . . , M sample V 3 from pj ( • \(Y', V 3 " 1 ), xi) . Notice the 

conditioning on the value X\ = x/. 

5. Define the weights 

wl = m (y*,xi) P j ({Vi-\ V\Y')\V 3 \ x l+ i) 
xA 7 ((y',n...,n x l+1 ,xi). 
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6. Set 

Y n+1 = (..., x l+u Y',x h ...) 

with probability 

M AM ■ U ^+l(^)Ejll^ 
(9) A, = mm.{ 1, J —r i 

and 

Y n+l = Y n = {..., x u x u x l+l ,...) 
with probability 1 — A M . 

The transition probability density for the above swap move from x to y 
for x, y £ Eq x • • • x E^is again given by 



ipl^(y\x) = P [{Swap is rejected}] J^c) 



{yj= x j} 



+ P 



{Swap is accepted} D |y' = yzj 



x <W,w + i)=to+iA)} n 6 



{Vj= x j} 

where and 5 is again the Dirac delta function. Again, the density 
P {Swap is accepted} n VY' = yij cannot and need not be evaluated. 
Algorithm [T] can be derived from Algorithm [2] by setting 

p\ (u j \(u°, . . . ,u 3 ~ 1 ), xi"j =pi (u J \xi"j 

and 

X 3 ((u°,...,u 3 ), x h xi +1 ) = — — w -. 

Notice also that if 



A j ((u°, . . . ,u j ),x h xi +1 



q 3 ((u 1 , ...,u 3 l )\x t ,x l+1 ) 



pj((vj 1 ,...,u°)\vJ, %i) p\ ({u 1 , . . . ,vj)\u° , 
where, for each j, q 3 is a conditional density satisfying 

q 3 ({u l ,...,u 3 ~ l )\xux l+ ^j =q 3 ([u 3 " 1 , . . . , v})\xi +1 , Xij 
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then 



E j 

n 



f j 

/ 7ri(u j ,xi + i)ct> ((mi,.. . ,Uj-i)\xi,x i+ i) \\_du 1 = Tfi(x w , 

i=l 



Thus, if the pj generate a sequence that satisfies a Law of Large Numbers, 
then jj Wjj — > TTi(xi + i). The same holds for the W v so that 

A x -> mm<^ 1, — — \ = Ai. 

I TTi{xi)7ri +1 (xi + i) J 

More general choices of A J lead to Af 1 which converge to correpondingiy 
more general acceptance probabilities than A\. 

Of course, expression ([5]) points the way to even more general algorithms. 
Algorithms [T] and [2] correspond to choices of 7 in ^ that make the con- 
ditional expectation on the bottom of ^ equal to one. Other choices of 7 
may improve the variance of the resulting weights. 

Proposition 3. The transition probabilities ipf 1 satisfy the detailed bal- 
ance condition for the measure II. 

Proof. Fix x,y 6 EqX- ■ -xEl such that x ^ y. For x,y G EqX ■ ■ • X El 
such that x y, 

II(x) ^/ M (y|x) = II(x) P {Swap is accepted} n (y' = y z }] 

As in the previous two proofs it can be assumed that Xj = yj for all j except 
I and I + 1 and (yi,yi + i) = (scj.fi, xi)- Since in this case ir(xj) = n(yj) for all 
j £ {/, I + 1} , it remains to show that if (yi,yi+i) = (scj+i, £j) then 

-R ((aij, sci+i) , (yj, = vr^ (x z ) vr m 

x P {Swap is accepted} H {y 7 = jfj j 

is symmetric in (xi,xi + \) and (yi,yi + i). Summing over disjoint events, 
P {Swap is accepted} fl |y' = yA = 

M 

p [{Swap is accepted} n |yj = U j \ n {J = j} 
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Thus R will be symmetric if for each j the function 

Rj ((xi,xi +1 ), (yi,y i+1 )) = m (xi) vr i+ i (xi+i) 

{Swap is accepted} fl |K' = yi \ r\ {J = j} 



x P 

is symmetric. 

Rj ((xi,xi+i), (yi,y i+1 )) = m (xi) 717+1 (xi+i) 
x / min< 1 



7T I+ 1(X,)E^1^ \ W* 



%i(x !+ i)Ef=i^iEf=i< 

x <5(u J ' - - £,) f J] 5{u j - k -v k ) \ I J] du V 

\i<k<j J \k>i, k^j 

Recall the definition of the weights and the fact that U J = y\ , V J = x\ , and 

yj = jjJ-j for j = 1,..., J - 1 

W(j = ni (y h xi +1 )p\ • • • \yi, xtj 

x A J (Jxuu 1 , . . .,vP~ l ,yi) , x h x i+ ij 

= n {yuxi+i)p\ [(v 1 , . . . \m, fy) 

x X j (Jx^u 1 , . . .,u j ~ 1 ,yij , x u xi + ij . 

Thus, 

Rj ((xi,xi+i), (yi,yi+i)) = ni (xi) vr m (xi+i) 



Ki + i(xi)ElUw k 



x / mm i ; w fc r ^M TT/fc ^ (m+i) 



x . . .,v M )\(y h v\ . . .,vi), xt) V f (iu\ . . .,u M )\x h x l+1 ) 



x 5(u j - m)5(v j -xi)[ 6(u j - k - v k ) \ I JJ du k dv 

\l<k<j J Vol, k^j 

Definition ([8]) implies that for all j, 



A.- 



1 

Pi 



((u 1 ,...,^'- 1 )^, x / )pf f ((^ +1 ,...^ M )|(y^ 1 ,...,^- 1 ,x z ), x,) 

= pj*((t> l ,...,t^)|fc, *,), 
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Thus, 

Rj ((xi,x i+1 ), (yi,yi+i)) = 

Til (xi) m +1 (x,+i) / minj 1, u f 

717 (^xh-i) A J ((x^n 1 , . . . j/;) , xi,x i+1 ) 

x^jV,...,^, x)pf({u\...,u M )\x u x, +1 ) 

x <5(^' - y ; )<v - *i) f n <H nj ~ fc - ^ fc ) ) f n 

\i<fe<i / \fe>i, k^j 

which can be rewritten, 

Rj ((xi, xi+i), (yi, yi+i)) = tti (xi) vr i+ i (x l+1 ) 717 (y h x i+1 ) 717+1 (xi) 

X / mm {vr m (xOEf=i^' 7r m (x m )E^li^} 
xpt*((v\...,v M )\y h x)pf((u\...,u M )\x h x,+i) 

x <5(^' - - *i) f n - yk ) ] f n 

\l<k<j J \k>l, k^j 

Plugging yi + \ = xi and y/ = x;+i, into this expression yields, 
i?j ((x/, x i+ i), (y/, yi+i)) = tti (xi) 717+1 (x i+1 ) 717 (yi) vr z+ i (y i+1 ) 

r . r i i 

x / inin s 

J l%i(!/l + i)Ef = i^' vr m (x m )Ef=i^ 

x A J ((x^u 1 , . . . ,u J ~ 1 ,y^ , yz+i^+i) 
x^(V,...,^)|y,, ym )^((^,---^ M )l^ 



X 



l<fc<j / \fc>l, k^j 



By the symmetry property of Xj this expression is symmetric in (x;,x;+i) 
and (yi,yi+i). □ 
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Clearly a Markov chain that evolves only by swap moves cannot sample all 
configurations, ie. the chain generated by ip is not </>-irreducible for any non 
trivial measure eft. These swap moves must therefore be used in conjunction 
with a transition rule that can reach any region of space. More precisely, 
let r from expression Q be Harris recurrent with stationary distribution II 
(see [IS])- The the transition rule for parallel marginalization is 

Tpm(y\x) = (1 - a) r(y\x) + a J T(z\x)ip (y\z) dz 

where 

^) = Ek(yk) 

fc=0 L 

and a £ [0, 1) is the probability that a swap move occurs. r pm dictates that, 
with probability a, the chain attempts a swap move between levels I and 
7+1 where I is a random variable chosen uniformly from {0, . . . , L — 1}. 
Next, the chain evolves according to r. With probability 1 — a the chain 
moves only according to r and does not attempt a swap. The next result 
guarantees the invariance of IT under evolution by Tp m . 

It is not difficult to verify that the chain generated by r pm has invariant 
measure IT and is Harris recurrent if the chain generated by r has these 
properties. Thus by combining standard MCMC steps on each component, 
governed by the transition probability t, with swap steps between the com- 
ponents governed by tp, an MCMC method results that not only uses in- 
formation from rapidly equilibrating lower dimensional chains, but is also 
convergent. 

3. Numerical Examples. In this section I consider applications of 
parallel marginalization to two conditional path sampling problems for a 
one dimensional stochastic differential equation, 

(10) dZ(t) = / (Z(t)) dt + a (Z(t)) dW(t), 

where / and a are real valued functions of R. One must first approximate 
Z{t) by a discrete process for which the path density is readily available. Let 
to = 0, t\ = ^, . . . , tN = T be a mesh on which one wishes to calculate path 
averages. One such approximate process is given by the linearly implicit 
Euler scheme (a balanced implicit method, see |16j). 

X(n+1) =X(n) + f(X(n)) A 

(11) + (X(n + 1) - X(n)) f (X(n)) A+a (X(n)) f (n), 
X(0) = Z(0). 
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Here X(ri) is an approximation to Z at time t n . The reader should note that 



the rate of convergence of the above scheme to the solution of ( 10 ) would not 



be effected by the insertion in (11 ) of a non- negative constant in front of the 
/ term. The choice of 1 made here seemed to improve the stability of the 
resulting scheme for large values of A. The £(n) are independent Gaussian 
random variables with mean and variance 1 , and A = £ . N is assumed to 
be a power of 2. The choice of this scheme over the Euler scheme (see |17j ) 
is due to its favorable stability properties as explained later. It is henceforth 
assumed that X(t) instead of Z(t) is the process of interest. 

The first of the conditional sampling problems discussed here is the bridge 
sampling problem in which one generates samples of transition paths be- 
tween two states. This problem arises, for example, in financial volatility 
estimation where, given a sequence of observations, (z(sq), . . . , z{sk)) with 
{sj} C {ti} , the goal is to estimate the diffusion term a (assumed here to be 
constant) appearing in the stochastic differential equation. Since in general 
one cannot easily evaluate the transition probability between times Sj and 
Sj + i (and thus the likelihood of the observations) it is necessary to generate 
samples between the observations, 

V(j) = (X(j,l),...,X(j,N J )) 

where Nj = N (sj+i — Sj) — 1 (assumed to be an integer) and X(j, n) denotes 
the value of the process at time Sj + It is then easy to evaluate the 
likelihood of a path 

X(s ),V(0),...,X(s K ),V(K) 

given a particular value of the volatility, a. 

The filtering/smoothing problem is similar to the financial volatility ex- 
ample of the previous paragraph except that now it is assumed that the 
observations are noisy functions of the underlying process. For example, one 
may wish to sample possible trajectories taken by a rocket given somewhat 
unreliable GPS observations of its position. If the conditional density of 
the observations given the position of the rocket is known, it is possible to 
generate conditional samples of the trajectories. 

3.1. Bridge path sampling. In the bridge path sampling problem one 
seeks to approximate conditional expectations of the form 

E [g (Z(ti), Z(t N ^)) \{Z(Q) = z-},{Z(T) = z+}} 



where g is a real valued function, and Z(t) is solution to (10). 
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Without the condition Z(T) = z + above, generating an approximate sam- 
ple (X(0), . . . ,X(N)) path is a relatively straitforward endeavor. One sim- 
ply generates a sample of Z(0), then evolves ( |11[ ) with this initial condition. 
However, the presence of information about {Z(t)} t>0 complicates the task. 
In general, some sampling method which requires only knowlege of a func- 
tion proportional to conditional density of (X(l), . . . ,X(N — 1)) must be 



applied. The approximate path density associated with discretization (11) 
is 

(12) 7T (x (l), . . . , x (N - 1) \x o {0),x o {N)) 

JV-l 



( \ 

oc exp I - 22 V (xo(n),x (n + 1), A) J 
V fc=o / 



where 



V{x,y,A) 



(l-A/ (x)) (y - x) + Af (x) 



2 



2a 2 (x) A 



At this point the parallel marginalization sampling procedure is applied 
to the density ttq. However, as indicated above, a prerequisite for the use of 
parallel marginalization is the ability to estimate marginal densities. In some 
important problems homogeneities in the underlying system yield simplifi- 
cations in the calculation of these densities by the methods in [6| |8] . These 
calculations can be carried out before implementation of parallel marginal- 
ization, or they can be integrated into the sampling procedure. 

In some cases, computer generation of the {^i}i >0 can be completely 
avoided. The examples presented here are two such cases. Let 
S t = |o,2', 2(2'), 3(2'),..., JV} (recall N is a power of 2). Decompose 5, 

as Si U Si where 
and 



§, = {0,2(2 , ),4(2 , ),6(2 , ),...,tf} 



5 z = {2',3(2'),5(2'),7(2'),...,iV-2'}. 

In the notation of the previous sections, x\ = (xi,xi) where 
xi = {xi(n)\ <?, r _ . n and xi = {xi(n)\ ■ In words, the hat and tilde 

L v 7 1 n(Eoi \|U,-/V | ' L v ' J Tito; 

variables represent alternating time slices of the path. For all I fix xi(0) = z~ 
and xi(N) = z + . We choose the approximate marginal densities 



n ({^( n )}neSA{o,7V} \ x i(°)> x i(. N )) Qi \{ x l(n)} n 
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where for each I, qi is defined by successive coarsenings of (11). That is, 



(N/2 l -l 
- J2 v(x l (2 l k),x l (2 l (k + l)),2 l A 
k=0 

Since 717 will be sampled using a Metropolis-Hastings method with x(0) and 
x(N) fixed, knowlege of the normalization constants 

Zi(xi(0),Xi(N)) =Jqi J] d Xl {n) 



n£Si\{0,N} 



is unnecessary. 



Notice from (12) that, conditioned on the values of X(n — 1) and X(n + 
1), the variance of X{n) is of order A. Thus any perturbation of X(n) 
which leaves X(m) fixed for m 7^ n and which is compatible with joint 
distribution (jl 21) must be of the order yfK. This suggests that distributions 



defined by coarser discretizations of (12) will allow larger perturbations, and 



consequently will be easier to sample. However, it is important to choose a 
discretization that remains stable for large values of A. For example, while 
the linearly implicit Euler method performs well in the experiments below, 
similar tests using the Euler method were less successful due to limitations 
on the largest allowable values of A. 

In this numerical example bridge paths are sampled between time and 
time 10 for a diffusion in a double well potential 

f{x) = —Ax (x 2 — lj and a(x) = 1 

The left and right end points are chosen as z~ = z + = 0. A = 2~ 10 . 
Y t n G R 10 /( 2A ) +1 is the I th level of the parallel marginalization Markov 
chain at algorithmic time n. There are 10 chains (L = 9). The observed 
swap acceptance rates are reported in table Q. Notice that the swap rates 
are highest at the lower levels but seems to stabilize at the higher levels. 

Table 1. Swap acceptance rates for bridge sampling problem 



Levels 1 0/1 


1/2 


2/3 


3/4 


4/5 


5/6 


6/7 


7/8 


8/9 


0.86 


0.83 


0.75 


0.69 


0.54 


0.45 


0.30 


0.22 


0.26 


1 Swaps between 


levels 


I and 


' + 1. 













Let Y^ id € M. denote the midpoint of the path defined by Yq (i.e. an 
approximate sample of the path at time 5). In Figure IT] the autocorrelation 



imsart-aos ver. 2007/04/13 file: weare07cv2.tex date: February 5, 2008 



21 



nf V n 
01 r mid 



Corr Ymid^mid 



is compared to that of a standard Metropolis-Hastings rule using 1 dimen- 
sional Gaussian random walk proposals. In the figure, the time scale of the 
autocorrelation for the Metropolis-Hastings method has been scaled by a 
factor of 1/10 to more than account for the extra computational time re- 
quired per iteration of parallel marginalization. The relaxation time of the 
parallel chain is clearly reduced. 



0.8 



0.4 



0.2 



— parallel marginalization 
Metropolis-Hastings 



Fig 1. Autocorrelation of Y^ id for Metropolis-Hastings method with 1-d Gaussian ran- 
dom walk proposals (solid) and parallel marginalization (dotted). The x-axis runs from to 
10000 iterations of the Metropolis-Hastings method and from to 1000 iterations of par- 
allel marginalization. This reseating more than compensates for the extra work for parallel 
marginalization per iteration. 



In these numerical examples, parallel marginalization is applied with a 
slight simplification as detailed in the following algorithm. 

Algorithm 3. The chain moves from Y n to Y n+1 as follows: 

1. Generate M independent Gaussian random paths {C m ( n )} ne g with 
independent components £ m (n) of mean and variance 2 i_1 A. 

2. For each j and n G Si let 

jjm (n) = C m (n) + Q 5 (xm(n _ 1} + Xi+i{n + 1}) _ 
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3. Define the weights 

= 7T; (x l+1 ,U m ) 

u Pl {U™\x l+1 )' 
where p\ is defined by the choice in step 1 as 

I 



pi (xi\xi) oc exp 



(xi(n) - 0.5 {xi{n - 1) + xi{n + l))) 2 
\nes, 



4- Choose J G {1, . . . , M} according to the probabilities 

p[j=j]= 5 > • 

Z^fc=l ^t/ 

5et ?' = U J . 
5. Set V J = xi and for j ^ J set 

V j (n) = ( j (n) + 0.5 (xj(n - 1) + x ? (n + 1)) . 

5. Define the weights 

wm = n{x u V m ) 
V Pi{V m \xi)' 

7. Set 

Y n+1 = (...,xi +1 ,yi,x h ...) 

with probability 
and 

Y n+1 = Y n = (..., x h x h x l+1 ,...) 
with probability 1 — Aj . 

This simplification reduces by half the number of gaussian random vari- 
ables needed to evaluate the acceptance probability but may not be appro- 
priate in all settings. For this problem, the choice of M in (13), the number 
of samples U m and V m , seems to have little effect on the swap acceptance 
rates. In the numerical experiment M = I + 1 for swaps between levels / and 
l + l. 

The results of the Metropols-Hastings and parallel marginalization meth- 
ods applied to the above bridge sampling problem after a run time of 10 
minutes on a standard workstation are presented in Figures [2] and [3j Ap- 
parently the sample generated by parallel marginalization is a reasonable 
bridgepath while the Metropolis-Hastings method has clearly not converged. 
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Fig 2. Metropolis generated bridge path from Section 3.1 after a 10 minute run on a 
standard desktop workstation. Clearly the method has not converged. 

3.2. Non-linear smoothing /filtering. In the non-linear smoothing and fil- 
tering problem one seeks to approximate conditional expectations of the 
form 



where the real valued processes Z(t) and H{j) are given by the system 



g, f, a, and r are real valued functions of M. The xU) are rea l valued 
independent random variable drawn from the density ft and are independent 
of the Brownian motion W(t). {sj} C {tj} , and = so < si < ... < sk = T. 
The process Z{t) is a hidden signal and the H{j) are noisy observations. 
The idea of computing the above conditional expectation by conditional 
path sampling has been suggested in |18[ [19]. Popular alternatives include 
particle filters (see [5D]) and ensemble Kalman filters (see |21|). 




Again, begin by discretizing the system. Assume that Nj = N (s J+ i — Sj) — 
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K 




(14) 



dZ(t) = f (Z(t)) dt + a (Z(t)) dW(t) 

H(j) = r(Z( Sj )) + X (j), 
Z(0) ~ p, x(j) ~ i-i-d. ix. 
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Fig 3. Parallel marginalization generated bridge path from Section 3.1 after a 10 minute 
run on a standard desktop workstation. Apparently the method has converged. 

1 is an integer and let A = ^. The linearly implicit Euler scheme gives 

X(j,n+l)=X(j,n)+f(X(j,n))A 

+ (X(j, n + 1) - X(j, n)) f (X(j, n))A + a (X(j, n)) n), 
H{j) = r{X{j))+ x {j), 
X(0) = Z(0) xU) ~ i-i-d. V 

where X(j,n) represents the discrete time approximation to Z(sj + nA), 
for < n < Nj. The £(n) are independent Gaussian random variables with 
mean and variance 1. The £(n) are independent of the x m • N is again 
assumed to be a power of 2. 

The approximate path measure for this problem is 

7T (xo(0),...,xo(A f ) \h(0),...,h(K)) 

oc exp I - ^2 V(x (n),x (n + 1), A) J 

K 

xp(x (0))l[fi(xo(j)-r(h(j))) 

n=0 
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The approximate marginals are chosen as 
TTj ({x,(n)} neSi \h(0),...,h(K)) oc 

K 

Qi ({siHWj p(^(o)) II M(^(i) - r 

n=0 

where V, and 5; are as defined in the previous section. 

In this example, samples of the smoothed path are generated between 
time time and time 10 for the same diffusion in a double well potential. 
The densities \x and p are chosen as 

H = N(0, 0.01) and p(x) oc exp (- (x 2 - l)^ 

The function r in ( |14[ ) is the identity function. The observation times are 
so = 0, si = 1, . . . , sio = 10 with H(j) = —1 for j = 0, . . . , 5 and H(j) = 1 
for j = 6, ... ,10. A = 2~ 10 . There are 8 chains (L = 7). The observed swap 
acceptance rates are reported in table ([2]). Notice that the swap rates are 
again highest at the lower levels but, for this problem, become unreasonably 
small at the highest level. 



Table 2. Swap acceptance rates for filtering/smoothing problems 



Levels 1 


0/1 


1/2 


2/3 


3/4 


4/5 


5/6 


6/7 




0.86 


0.83 


0.74 


0.65 


0.46 


0.23 


0.04 



1 Swaps between levels I and I + 1. 



Again, Y£ id E R denotes the midpoint of the path defined by Yq 1 (i.e. an 
approximate sample of the path at time 5). In Figure [4] the autocorrelation 
of Y^ id is compared to that of a standard Metropolis-Hastings rule. The 
figure has been adjusted as in the previous example. The relaxation time of 
the parallel chain is again clearly reduced. 

The algorithm is modified as in the previous example. For this problem, 
acceptable swap rates require a higher choice of M in ( 13 ) than needed in the 
bridge sampling problem. In this numerical experiment M = 2 l for swaps 
between levels I and 1 + 1. 

The results of the Metropols-Hastings and parallel marginalization meth- 
ods applied to the smoothing problem above after a run time of 10 min- 
utes on a standard workstation are presented in figure [5] and [6} Apparently 
the sample generated by parallel marginalization is a reasonable bridgepath 
while the Metropolis-Hastings method has clearly not converged. 



imsart-aos ver. 2007/04/13 file: weare07cv2.tex date: February 5, 2008 



2G 



J. WE ARE 




Fig 4. Autocorrelation of Y^ id for Metropolis-Hastings method with 1-d Gaussian ran- 
dom walk proposals (solid) and parallel marginalization (dotted). The x-axis runs from to 
10000 iterations of the Metropolis-Hastings method and from to 1000 iterations of par- 
allel marginalization. This reseating more than compensates for the extra work for parallel 
marginalization per iteration. 

4. Conclusion. A Markov chain Monte Carlo method has been pro- 
posed and applied to two conditional path sampling problems for stochastic 
differential equations. Numerical results indicate that this method, parallel 
marginalization, can have a dramatically reduced equilibration time when 
compared to standard MCMC methods. 

Note that parallel marginalization should not be viewed as a stand alone 
method. Other acceleration techniques such as hybrid Monte Carlo can and 
should be implemented at each level within the parallel marginalization 
framework. As the smoothing problem indicates, the acceptance probabili- 
ties at coarser levels can become small. The remedy for this is the develop- 
ment of more accurate approximate marginal distributions by, for example, 
the methods in [6] and |S]. 
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Fig 5. Metropolis-Hastings generated smoothed path from Section 3.2 after a 10 minute 
run on a standard desktop workstation. Clearly the method has not converged. 
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Fig 6. Parallel marginalization generated smoothed path from Section 3.2 after a 10 
minute run on a standard desktop workstation. Apparently the method has converged. 
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